Transcription profiling by RNA-sequencing of normal and cirrhosis liver tissue and Kupffer cells
Introduction
Liver tissue and Kupffer cells from healthy patients and patients with liver cirrhosis were analysed. Both the cell types included 4 samples from patients with cirrhosis (cases) and 5 controls samples.
Data collection (skip if you already have the DGElist file)
# Save experimental data
gsepheno <- pData(gse$GSE123661_series_matrix.txt.gz)
# Download supplementary files with raw count data
getGEOSuppFiles("GSE123661")
# List the file names
files <- untar("./GSE123661/GSE123661_RAW.tar", list =T)
# Unzip tar file
untar("./GSE123661/GSE123661_RAW.tar")
# Read first five lines from the first file
read.delim(files[1], nrow = 5)
Next, we will filter out genes with low cpm values. Genes that are not expressed in any sample will eventually not result in differential expression. Additionally, genes with low or zero cpm values in most of the samples are unlikely to result in differential expression. It is best to remove these genes as these genes will also influence the outcome for multiple testing (= more power after removing them).
Determine how many genes show no expression across all samples?
Remove all genes that do not have a cpm value of at least 0.5 in at least three samples: (A cpm value of 0.5 corresponds to 10 reads for a library size of 20.000.000.)
# in total 20827 genes (interesting + uninteresting genes)
liver.keep.exprs3 <- filterByExpr(dge.liver, group=dge.liver$sampples$group) #smallest group is 4
sum(liver.keep.exprs3) #15051 genes left
## [1] 15051
kupffer.keep.exprs3 <- filterByExpr(dge.kupffer, group=dge.kupffer$sampples$group) #smallest group is 4
sum(kupffer.keep.exprs3) #16996 genes left
## No id variables; using all as measure variables
ggplot(lcpm.filtered.counts.plot.K, aes(x = Expression, colour = Sample)) +
geom_line(stat = 'density')+
labs(title = 'Filtered data Kupffer', x = 'Log-cpm')
Normalisation
Normalisation is a process designed to identify and remove systematic technical differences between samples that occur in the data to ensure that technical bias has minimal impact on the results.
Trimmed Mean of M-values (TMM) (Robinson and Oshlack 2010)
Using the calcNormFactors function from the edgeR library, we normalise our data using TMM normalisation:
The effective library size can be obtained by multiplying the library size with additional normalisation factors. Note that for the most count-based statistical testing procedures, the normalization factors are used directly in the statistical models. In other words, in most cases you can simply supply methods such as EdgeR with the raw count data and the scaling factors.
Genomic datasets are often high-dimensional. Dimensionality reduction analyses provide a way of reducing the complex dataset into a lower dimension (that is more accessible for interpretation) while attempting to retain as much relevant information as possible.
MDS
The plotMDS function performs dimensionality reduction and the distances between the gene expression profiles are visualised. The distances correspond to the leading log-fold-change (if the input data is in the log scale), determined as the root mean square (Euclidean) average of the largest log2-fold changes between each pair of samples (the standard option looks at the top 500 genes with the largest log-fold change for each pair).
# The underlying MDS function is the cmdscale function
# We need the log transformed lcpm counts
mds.liver <- plotMDS(lcpm.norm.L, dim = c(1,8), gene.selection = 'pairwise', plot = FALSE)
mds.kupffer <- plotMDS(lcpm.norm.K, dim = c(1,8), gene.selection = 'pairwise', plot = FALSE)
ggplotMDS <- function(mds, metadata, param, dim1, dim2, label){
# Create MDS plot with ggplot
# Input: mds object
# annotation dataframe
# parameter of interest in the annotation dataframe
# first dimension of interest
# second dimension of interest
# how you want the samples to be labeled
# Extract cmdscale from mds object
dataMDS <- as.data.frame(mds$cmdscale.out)
i <- metadata[colnames(metadata) == param]
l <- metadata[colnames(metadata) == label]
plot <- ggplot(data = dataMDS, aes(x = dataMDS[,dim1] ,
y = dataMDS[,dim2])) +
geom_point(aes(colour = as.factor(i[,1])), size = 2)+
labs(x = paste0("Dimension ", dim1), y = paste0("Dimension ", dim2),
colour = param) +
geom_text(aes(label = l[,1], colour = as.factor(i[,1])),
size = 2.8, vjust = -0.5) +
theme(axis.title.x=element_text(size=11),
axis.title.y=element_text(size=11),
legend.text = element_text(size=10),
legend.title = element_text(size = 11, face = "bold"))
return(plot)
}
We can see that the samples from the kupffer cells are more spread out and there is less separation between the control and case samples. So we expect less significant differential expression compared to the liver tissue samples.
Determine how many genes have an FDR-rate lower than 0.01?
# Number of genes FDR < 0.01
#liver
top.qlf.liver <- topTags(qlf.liver, n=Inf)
sum(top.qlf.liver$table$FDR<0.01) #34 rather low for GSA (ideally 100-500 DE genes), for GSA use genes with FDR >0.05 -> 405 genes
#kupffer
top.qlf.kupffer <- topTags(qlf.kupffer, n=Inf)
sum(top.qlf.kupffer$table$FDR<0.01) # <0.01 -> 0 genes (Top gene FDR = 0.368) already expected less DE genes compared to the liver cells when we saw the MDS plots
conclusion: The liver tissue has 34 significantly DE genes with an FDR < 0.01 –> with GSA we can find out which pathways are over represented. We will use DE genes with an FDR < 0.05 to include enough DE genes (405 genes). For GSA this higher FDR will not be a problem. Kupffer cells show no DE genes (top gene FDR = 0.368). We already expected less DE genes compared to the liver cells when we saw the MDS plots. From here we only work further with the liver tissue cells.
Accounting for confounders
The phenodata didn’t mention any confounders and their research is not available. So unfortunately we could not take into account the effect of confounders like age, gender, alcohol use, etc. on the relation between disease and gene expression.
#save EdgeR DE-output for next step:
#save data needed for GSA as text
write.table(top.qlf.liver$table,"EdgeR_output_Liver.txt",sep="\t",row.names=T)
write.table(top.qlf.kupffer$table,"EdgeR_output_Kupffer.txt",sep="\t",row.names=T) #actually not needed
Gene set analysis
“overrepresentation analysis”: certain GO-term (Gene Ontology term) or a certain pathway is overrepresented amongst our set of interest (e.g. differentially expressed genes) compared to the background. The selection of both this set of interest and the background is of vital importance. Note that selecting a faulty background can already introduce (some) overrepresentation.
To test for overrepresentation, for each term of interest (GO/Pathway) the number of genes in our set of interest and in the background is counted and then tested (typically by chi-squared/Fisher exact test) whether the occurence in the set of interest is significantly different from the occurence in the background. We then correct for multiple testing based on the number of gene sets we analyse.
A FDR cut off of 0.01 gives 34 DE genes. Therefore we chose a FDR cut off of 0.05 which will include 405 DE genes in our GSA set. Once we have have our background genes and gene set of interest saved as a textfile, we can use the webtool Webgestalt to perform GSA.
# Select Gene symbols for the background (i.e. all expressed genes)
Gene.symbol.background <- res.liver$Gene.symbol
Gene.symbol.background <- gsub("\"","",Gene.symbol.background)
write.table(Gene.symbol.background, file="Gene_symbol_background.txt",
col.names=F, quote=F, row.names=F)
# Select Gene symbols for the set of interest
Gene.symbol.sign <- res.liver$Gene.symbol[(res.liver$FDR<0.05)]
Gene.symbol.sign <- Gene.symbol.sign[!is.na(Gene.symbol.sign)]
Gene.symbol.sign <- sort(as.character(Gene.symbol.sign))
length(Gene.symbol.sign)
Now that we have have our background genes and gene set of interest saved as a textfile, we can use the webtool Webgestalt to perform GSA. The geneontology database was selected to find out which biological processes are overrepresented in our gene set of interest.
Interesting list: Gene_symbol_sign_1620406440.txt. ID type: genesymbol
The interesting list contains 405 user IDs in which 379 user IDs are unambiguously mapped to 379 unique entrezgene IDs and 26 user IDs can not be mapped to any entrezgene ID.
The GO Slim summary are based upon the 379 unique entrezgene IDs.
Among 379 unique entrezgene IDs, 307 IDs are annotated to the selected functional categories and also in the reference list, which are used for the enrichment analysis.
Reference list: uploads/Gene_symbol_background_1620406440.txt ID type: genesymbol
The reference list can be mapped to 13913 entrezgene IDs and 11918 IDs are annotated to the selected functional categories that are used as the reference for the enrichment analysis.
Parameters for the enrichment analysis:
Minimum number of IDs in the category: 5
Maximum number of IDs in the category: 2000
FDR Method: BH
Significance Level: Top 10
Based on the above parameters, 10 categories are identified as enriched categories and all are shown in this report.
GO Slim summary for the user uploaded IDs
Each Biological Process, Cellular Component and Molecular Function category is represented by a red, blue and green bar, repectively.
The height of the bar represents the number of IDs in the user list and also in the category.
Enrichment Results
Redundancy reduction:
All
Affinity propagation
Weighted set cover